pacman::p_load(ggiraph, plotly, patchwork, DT, tidyverse,knitr,FunnelPlotR,scales) Take Home Exercise 1
Visual Analytics of the demographic and financial characteristics of residents in City of Engagement
1. OVERVIEW
City of Engagement, with a total population of 50,000, is a small city located at Country of Nowhere. The city serves as a service centre of an agriculture region surrounding the city. The main agriculture of the region is fruit farms and vineyards. The local council of the city is in the process of preparing the Local Plan 2023. A sample survey of 1000 representative residents had been conducted to collect data related to their household demographic and spending patterns, among other things. The city aims to use the data to assist with their major community revitalization efforts, including how to allocate a very large city renewal grant they have recently received.
1.1 The Task
In this take-home exercise, you are required to apply the concepts and methods you had learned in Lesson 1-4 to reveal the demographic and financial characteristics of the city of Engagement, using appropriate static and interactive statistical graphics methods.
2. Data Source
3. Data Preparation
3.1 Install R-packages
Using p_load() of pacman package to load the required libraries
3.2 Import Data
3.2.1 Import participants dataset
participants <- read_csv("data/Participants.csv")3.2.2 Load participants
# A tibble: 6 × 7
participantId householdSize haveKids age educationLevel interestGroup
<dbl> <dbl> <lgl> <dbl> <chr> <chr>
1 0 3 TRUE 36 HighSchoolOrCollege H
2 1 3 TRUE 25 HighSchoolOrCollege B
3 2 3 TRUE 35 HighSchoolOrCollege A
4 3 3 TRUE 21 HighSchoolOrCollege I
5 4 3 TRUE 43 Bachelors H
6 5 3 TRUE 32 HighSchoolOrCollege D
# ℹ 1 more variable: joviality <dbl>
head(participants)3.2.3 Import Financial Journal dataset
financial_journal <- read_csv("data/FinancialJournal.csv")3.2.4 Load Financial Journal
# A tibble: 6 × 4
participantId timestamp category amount
<dbl> <dttm> <chr> <dbl>
1 0 2022-03-01 00:00:00 Wage 2473.
2 0 2022-03-01 00:00:00 Shelter -555.
3 0 2022-03-01 00:00:00 Education -38.0
4 1 2022-03-01 00:00:00 Wage 2047.
5 1 2022-03-01 00:00:00 Shelter -555.
6 1 2022-03-01 00:00:00 Education -38.0
head(financial_journal)3.3 Data Wrangling
As seen from the two data tables, there are inaccurate data types. Code with mutate from dplyr to reformat participantID from dbl to chr. groupby participantID remove duplicates reformat timestamp to year and month create new variables for income, expenses, cashflow
The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame.
unique(financial_journal$category)[1] "Wage" "Shelter" "Education" "RentAdjustment"
[5] "Food" "Recreation"
#create new dataset
participants_new <- participants %>%
mutate(
participantId = as.character(participantId))
#reformat householdSize to Ordinal
participants_new$householdSize <- factor(participants$householdSize,
levels = c("1", "2", "3"),
ordered = TRUE)
#reformat age group
participants_new$age_group <- factor(ifelse(participants$age < 20, "Under 20",
ifelse(participants$age < 30, "20-29",
ifelse(participants$age < 40, "30-39",
ifelse(participants$age < 50, "40-49", "Above 50")))),
levels = c("Under 20", "20-29", "30-39", "40-49", "Above 50"),
ordered= TRUE)
#reformat education level to Ordinal
participants_new$educationLevel <- factor(participants$educationLevel,
levels = c("Low", "Graduate", "Bachelors",
"HighSchoolOrCollege"),
ordered = TRUE)
#round up joviality to 2 decimal places
participants_new$joviality <- round(participants$joviality, 2)
participants_new# A tibble: 1,011 × 8
participantId householdSize haveKids age educationLevel interestGroup
<chr> <ord> <lgl> <dbl> <ord> <chr>
1 0 3 TRUE 36 HighSchoolOrCollege H
2 1 3 TRUE 25 HighSchoolOrCollege B
3 2 3 TRUE 35 HighSchoolOrCollege A
4 3 3 TRUE 21 HighSchoolOrCollege I
5 4 3 TRUE 43 Bachelors H
6 5 3 TRUE 32 HighSchoolOrCollege D
7 6 3 TRUE 26 HighSchoolOrCollege I
8 7 3 TRUE 27 Bachelors A
9 8 3 TRUE 20 Bachelors G
10 9 3 TRUE 35 Bachelors D
# ℹ 1,001 more rows
# ℹ 2 more variables: joviality <dbl>, age_group <ord>
#check min and max age of residents in COE.
min(participants$age)[1] 18
max(participants$age)[1] 60
#remove duplicate rows for all columns
financial_journal_lessdup <- financial_journal %>%
distinct()You can use group_by() function along with the summarise() from dplyr package to find the group by sum in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df results to get the group by sum.
scales package (part of the Tidyverse) does exactly this:
Show the code
#create new dataset
grouped_data <- financial_journal_lessdup %>%
#recode ID from dbl to chr, year_mth
mutate(participantId = as.character(participantId),
year_mth = format(as.Date(financial_journal_lessdup$timestamp), "%Y-%m"),
amount = abs(round(amount,2)),
.before = 3) %>%
#group the columns in the following order
group_by(participantId,year_mth, category) %>%
summarize(total_amount = sum(amount))
# Pivot the data frame to have categories as columns
pivoted_fj <- grouped_data %>%
pivot_wider(names_from = "category", values_from = "total_amount", values_fill = 0)
# Add a new column with mixed categories
pivoted_fj$Expenses <- pivoted_fj$Education + pivoted_fj$Food + pivoted_fj$Recreation + pivoted_fj$Shelter +pivoted_fj$RentAdjustment
pivoted_fj$Income <- pivoted_fj$Wage
pivoted_fj$Cashflow <- pivoted_fj$Income - pivoted_fj$Expenses
pivoted_fj$Shelter <- pivoted_fj$Shelter + pivoted_fj$RentAdjustment
# Output the pivoted data frame
pivoted_fj# A tibble: 10,691 × 11
# Groups: participantId, year_mth [10,691]
participantId year_mth Education Food Recreation Shelter Wage
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2022-03 38.0 268. 349. 555. 11932.
2 0 2022-04 38.0 266. 219. 555. 8637.
3 0 2022-05 38.0 265. 383. 555. 9048.
4 0 2022-06 38.0 257. 466. 555. 9048.
5 0 2022-07 38.0 270. 1069. 555. 8637.
6 0 2022-08 38.0 262. 314. 555. 9459.
7 0 2022-09 38.0 256. 295. 555. 9048.
8 0 2022-10 38.0 267. 25.0 555. 8637.
9 0 2022-11 38.0 261 377. 555. 9048.
10 0 2022-12 38.0 266. 357. 555. 9048.
# ℹ 10,681 more rows
# ℹ 4 more variables: RentAdjustment <dbl>, Expenses <dbl>, Income <dbl>,
# Cashflow <dbl>
The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved.
colSums(pivoted_fj[-1] !=0) year_mth Education Food Recreation Shelter
10691 3018 10691 9492 10560
Wage RentAdjustment Expenses Income Cashflow
10691 72 10691 10691 10691
Check for missing values
#Check for missing values
any(is.na(participants_new))[1] FALSE
any(is.na(pivoted_fj))[1] FALSE
Merge Data Table
#join both data sets
resident_profile <- full_join(participants_new, pivoted_fj,
by = c("participantId" = "participantId")) %>%
#relocate columns to the front (by importance)
relocate(year_mth, .after =participantId) %>%
relocate(Cashflow, .after = year_mth) %>%
relocate(age_group, .after = Cashflow) %>%
relocate(educationLevel, .after = age_group) %>%
relocate(Income, .after = haveKids) %>%
relocate(Expenses , .after = Income)
resident_profile %>%
select(c(1:17))# A tibble: 10,691 × 17
participantId year_mth Cashflow age_group educationLevel householdSize
<chr> <chr> <dbl> <ord> <ord> <ord>
1 0 2022-03 10722. 30-39 HighSchoolOrCollege 3
2 0 2022-04 7559. 30-39 HighSchoolOrCollege 3
3 0 2022-05 7808. 30-39 HighSchoolOrCollege 3
4 0 2022-06 7733. 30-39 HighSchoolOrCollege 3
5 0 2022-07 6704. 30-39 HighSchoolOrCollege 3
6 0 2022-08 8291. 30-39 HighSchoolOrCollege 3
7 0 2022-09 7904. 30-39 HighSchoolOrCollege 3
8 0 2022-10 7752. 30-39 HighSchoolOrCollege 3
9 0 2022-11 7817. 30-39 HighSchoolOrCollege 3
10 0 2022-12 7832. 30-39 HighSchoolOrCollege 3
# ℹ 10,681 more rows
# ℹ 11 more variables: haveKids <lgl>, Income <dbl>, Expenses <dbl>, age <dbl>,
# interestGroup <chr>, joviality <dbl>, Education <dbl>, Food <dbl>,
# Recreation <dbl>, Shelter <dbl>, Wage <dbl>
DT::datatable(resident_profile, class= "compact")4. Demographics Insights
p1 <- ggplot(data=resident_profile,
aes(x = age_group)) +
geom_bar_interactive() +
ggtitle("Age group of residents ") +
theme(plot.title = element_text(hjust = 0.5))p2 <- ggplot(data=resident_profile,
aes(x = educationLevel)) +
geom_bar_interactive() +
ggtitle("Education level of residents ") +
theme(plot.title = element_text(hjust = 0.5))p3 <- ggplot(data=resident_profile,
aes(x = interestGroup)) +
geom_bar_interactive() +
ggtitle("Interest group of residents ") +
theme(plot.title = element_text(hjust = 0.5)) girafe(code = print((p1 + p2) / p3),
width_svg = 6,
height_svg = 3,
options = list(
opts_hover(css = "fill: #202020;"),
opts_hover_inv(css = "opacity:0.2;")
)
) 5. Financial Health Insights
plot_ly(data = resident_profile,
x = ~joviality, y = ~Income)p <- ggplot(data=resident_profile,
aes(x = Shelter)) +
geom_dotplot_interactive(
aes(tooltip = Expenses,
data_id = Expenses),
stackgroups = TRUE,
binwidth = 1,
method = "histodot") +
scale_y_continuous(NULL,
breaks = NULL)
girafe(
ggobj = p,
width_svg = 6,
height_svg = 6*0.618,
options = list(
opts_hover(css = "fill: #202020;"),
opts_hover_inv(css = "opacity:0.2;")
)
) average_income <- round(mean(resident_profile$Income),2)
median_income <-round(median(resident_profile$Income),2)
ymax <- as.numeric(round((IQR(resident_profile$Income)*1.5) +
quantile(resident_profile$Income,0.75)))
ymin <- as.integer(min(resident_profile$Income))
average_income[1] 4265.05
median_income[1] 3613.94
ymax[1] 9110
ymin[1] 1600
b <- ggplot(data = pivoted_fj, aes(y = Income)) +
geom_boxplot(outlier.colour="red", outlier.shape=16,
outlier.size=1, notch=FALSE) +
coord_flip() + labs(y = "", x = "") +
scale_y_continuous(limits = c(0,20000), labels = scales::comma) +
theme(axis.text = element_blank(), axis.ticks = element_blank()) +
stat_boxplot(geom="errorbar", width=0.5) +
annotate("text", x=0.35, y=ymax, label=format(ymax, big.mark = ","),
size=3, color="lightpink4") +
annotate("text", x=0.35, y=ymin, label=format(ymin, big.mark = ","),
size=3, color="lightpink4")
b
#plotting histogram
h <- ggplot(data = resident_profile,
aes(x = Income)) +
geom_histogram(color="black", fill="azure4", binwidth = 50000) +
scale_x_continuous(limits = c(0,1500000), labels = scales::comma) +
labs(x = "Resale Price (SGD)", y = "Number of transactions") +
geom_vline(aes(xintercept = resale_mean), col="darkblue", linewidth=1) +
annotate("text", x=640000, y=4000, label="Mean resale price:",
size=4, color="darkblue") +
annotate("text", x=640000, y=3750, label=format(average_income, big.mark = ","),
size=4, color="darkblue") +
geom_vline(aes(xintercept = resale_median), col="lightpink4", linewidth=1) +
annotate("text", x=400000, y=4000, label="Median resale price",
size=4, color="lightpink4") +
annotate("text", x=400000, y=3750, label=format(median_income, big.mark = ","),
size=4, color="lightpink4") +
theme(axis.text.x = element_text(size=8))funnel_plot(
numerator = resident_profile$Income,
denominator = resident_profile$Expenses,
group = resident_profile$age
)
A funnel plot object with 43 points of which 0 are outliers.
Plot is adjusted for overdispersion.
#Initiating the base plot
plot_ly(data = resident_profile,
x = ~joviality,
y = ~age,
color = ~haveKids,
hovertemplate = ~paste("<br>ID", participantId,
"<br>Education Level:", educationLevel,
"<br>Household Size ", householdSize),
type = 'scatter',
mode = 'markers',
marker = list(opacity = 0.6,
sizemode = 'diameter',
line = list(width = 0.2, color = 'white')))